Standard Random Forest

Author

David Bramwell

Published

March 14, 2023

1 Background

This short report details the Standard Random Forest in the Manus Models.py code. The code has a training and test split by default from the trainingData.csv file. I do not know the provenance of this file.

        traing_data = self.__read_local_file_into_dataframe("Analytics/trainingData.csv")

        X =  traing_data[self.__features]    
        y = traing_data['Diagnosis']
        
        self.target_population = "PD vs Not PD"

        x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=392)
    
        model = RandomForestClassifier(n_estimators=100, random_state=1)      
        model.fit(x_train, y_train)

The training and Random Forest classifier produced should be the same as is in the main code.

The test split was not used so I used this as a ‘blind’ test set to assess performance.

I also dump out all of the trees in the forest.

Important

The value of this blind analysis depends on the provenance of Analytics/trainingData.csv.

Initial backend commit is June 2021. This came from the original DiPar work but the current training file appears to include subjects from the Walker study so any test against 76patients_21HC is (potentially) biased. I can’t find the original model files to load the model derived just from DiPar data (comment in code just says it should be in the same folder. It isn’t.)

The actual training files don’t have subject IDs against them so it is hard to be definitive on the independence of the data but the blind plots do show an encouraging ‘more real world’ spread in probabilities.

dataset_details109.xlsx does have subject labels but the numbers don’t match trainingData.csv. trainingData.csv has 132 unlabelled FE data entries of 123 values. trainingData.csv comes from one of the first commits Martin committed 25 Jun 2021 : #4 - Initial commit of the backend functions

The Models.py code actually does a 70:30 train:test split on the data but then ignores the test. If you add the tests to it you get a ‘blind’ test accuracy of 0.8 and ROC AUC of 0.875 (as expected the train accuracy and ROC AUC are both perfect i.e. 1.0).

2 Feature Importance

Application of the standard assessment tools for tree assessment from sklearn

The permuted importances look to me mostly zeros which suggest great deal of multicollinearity.

Show the code
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(24, 16))
shutup = ax1.barh(tree_indices, RF_model.feature_importances_[tree_importance_sorted_idx], height=0.7)
shutup = ax1.set_yticks(tree_indices)
shutup = ax1.set_yticklabels(feature_importance.keys())
shutup = ax1.set_ylim((0, len(RF_model.feature_importances_)))
shutup = ax2.boxplot(
    result.importances[perm_sorted_idx].T,
    vert=False,
    labels=feature_importance.keys(),
)
fig.tight_layout()
plt.show()

Permuted Importance:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Measure Importance based on model fitting
circle_tremorCoordAmpMean 0.038023
zigzagCopy_tremorPcX 0.0316448
zigzagTrace_tremorAccAmpCoV 0.0236195
zigzagCopy_tremorPcY 0.0228122
elelFb_eAspectMean 0.0215612
elelFb_eRleftTrenddiff 0.0204285
spiral_timePerSpiralCoV 0.020219
elelFb_eWidthKurt 0.0195069
elelFb_lDurationTrenddiff 0.0174152
elelFb_eRtopTrenddiff 0.0172406
circle_timePerCircleMean 0.0168828
zigzagCopy_tremorGyroAmpCoV 0.016683
zigzagTrace_meanError 0.0165201
elelFb_lAspectMedian 0.0163189
circle_tremorAccAmpCoV 0.0149563
elelFb_eRrightStd 0.0147782
elelFb_eAspectMedian 0.0136633
zigzagCopy_timePerZigzagMean 0.0133687
elelFb_eWidthStdnorm 0.0125117
elelFb_lHeightStd 0.012426
elelFb_eRbotStdnorm 0.0114467
circle_tremorAccAmpMean 0.0112271
fittsCloseSmall_meanTouchScore 0.0111679
zigzagTrace_tremorPcRelVar 0.0111381
circle_tremorGyroAmpCoV 0.0108665
elelFb_eAspectTrenddiff 0.0107928
zigzagCopy_timePerZigzagCoV 0.0103392
circle_tremorFreq 0.0103004
zigzagCopy_tremorCoordAmpSlope 0.0102683
elelFb_lRrightKurt 0.0100347
zigzagTrace_tremorAccAmpMean 0.00990722
elelFb_lRbotKurt 0.00974684
zigzagCopy_tremorRelPower 0.00954921
elelFb_lDurationStd 0.00954488
elelFb_lSlantStdnorm 0.00912885
spiral_tremorPcY 0.00910023
elelFb_eSlantKurt 0.00861384
zigzagTrace_timePerZigzagMean 0.00853098
zigzagCopy_zigzagHeightMean 0.0084871
circle_tremorRelPower 0.00843968
elelFb_lRtopTrenddiff 0.00814985
elelFb_lAspectQuartdeltanorm 0.00807028
spiral_tremorGyroAmpMean 0.00806128
zigzagTrace_timePerZigzagSlope 0.00804289
elelFb_eSlantStd 0.00781069
elelFb_eRrightTrenddiff 0.00777269
elelFb_lWidthTrenddiff 0.00774431
spiral_timePerSpiralSlope 0.00771839
circle_tremorGyroAmpSlope 0.00763363
elelFb_lHeightQuartdeltanorm 0.00757614
elelFb_lRtopStd 0.00749684
fittsFarSmall_meanTouchScore 0.00747145
zigzagTrace_timePerZigzagCoV 0.00738818
circle_tremorAccAmpSlope 0.00738239
elelFb_lWidthMedian 0.00731198
zigzagCopy_timePerZigzagSlope 0.00698731
elelFb_eDurationQuartdeltanorm 0.0069314
elelFb_lHeightMedian 0.0068871
zigzagCopy_tremorAccAmpSlope 0.00678055
elelFb_lWidthStdnorm 0.0066614
elelFb_eDurationMedian 0.00663731
elelFb_lRtopQuartdelta 0.00643876
elelFb_lRrightMean 0.00643656
elelFb_eHeightQuartdeltanorm 0.00641194
circle_timePerCircleSlope 0.00636141
spiral_tremorCoordAmpCoV 0.006294
spiral_tremorCoordAmpSlope 0.00612302
zigzagCopy_tremorGyroAmpSlope 0.00599414
circle_tremorPcRelVar 0.00595
elelFb_lRbotQuartdelta 0.00585371
elelFb_lWidthQuartdeltanorm 0.0056237
elelFb_lRleftMean 0.00554411
elelFb_eAspectKurt 0.00551906
elelFb_eDurationStd 0.00542867
elelFb_eWidthMean 0.00537685
elelFb_eRrightMedian 0.00530436
elelFb_eDurationMean 0.00526872
elelFb_eRtopMedian 0.00526639
elelFb_lHeightQuartdelta 0.00522019
elelFb_lRbotStdnorm 0.00521638
circle_tremorCoordAmpCoV 0.00510235
elelFb_eHeightStd 0.00501626
elelFb_eRbotQuartdelta 0.0050029
elelFb_eSlantQuartdeltanorm 0.0049788
elelFb_eDurationStdnorm 0.0049375
elelFb_eSlantMean 0.00491982
elelFb_lRtopKurt 0.00487239
elelFb_eAspectStdnorm 0.00482574
spiral_tremorAccAmpCoV 0.00472211
spiral_meanError 0.00460279
elelFb_lRrightQuartdelta 0.00459938
elelFb_lAspectStdnorm 0.00455498
elelFb_eWidthQuartdelta 0.0045363
elelFb_eRtopMean 0.00452696
spiral_tremorRelPower 0.00451917
elelFb_lSlantQuartdeltanorm 0.0044299
elelFb_lHeightTrenddiff 0.00431181
spiral_tremorPcX 0.0042603
elelFb_lHeightStdnorm 0.00419021
elelFb_eRbotKurt 0.00403224
elelFb_lSlantMean 0.00402205
elelFb_lSlantTrenddiff 0.00400052
elelFb_lWidthKurt 0.00375997
spiral_tremorFreq 0.00372458
elelFb_lDurationKurt 0.00357136
zigzagTrace_tremorAccAmpSlope 0.00354527
elelFb_lRtopMean 0.00346917
elelFb_lRleftMedian 0.00338174
elelFb_lWidthTrendratio 0.00325557
elelFb_eWidthQuartdeltanorm 0.00306915
elelFb_eAspectQuartdelta 0.0027863
elelFb_lHeightTrendratio 0.00272977
elelFb_eRtopKurt 0.00271944
elelFb_lRrightStd 0.00259676
fittsCloseSmall_undershootPercentage 0.00248839
elelFb_lRrightStdnorm 0.00242851
zigzagTrace_tremorPcY 0.00239065
elelFb_eRbotStd 0.00219817
elelFb_lRbotMedian 0.00203492
elelFb_eRrightQuartdelta 0.00159648
elelFb_eRleftStd 0.00097355
elelFb_eRleftMean 0.000692266
elelFb_eRtopStdnorm 0.000265293

3 Multicollinear features

Show the code
X = x_train
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(24, 16))
corr = spearmanr(X).correlation

# Ensure the correlation matrix is symmetric
corr = (corr + corr.T) / 2
np.fill_diagonal(corr, 1)

# We convert the correlation matrix to a distance matrix before performing
# hierarchical clustering using Ward's linkage.
distance_matrix = 1 - np.abs(corr)
dist_linkage = hierarchy.ward(squareform(distance_matrix))
dendro = hierarchy.dendrogram(
    dist_linkage, labels=x_train.columns, ax=ax1, leaf_rotation=90
)
dendro_idx = np.arange(0, len(dendro["ivl"]))

ax2.imshow(corr[dendro["leaves"], :][:, dendro["leaves"]])
ax2.set_xticks(dendro_idx)
ax2.set_yticks(dendro_idx)
ax2.tick_params(axis='both', which='both', labelsize=8)
ax2.set_xticklabels(dendro["ivl"], rotation="vertical",)
ax2.set_yticklabels(dendro["ivl"])
fig.tight_layout()
plt.show()

4 Performance on training set

Show the code
# https://www.datacamp.com/tutorial/random-forests-classifier-python
y_pred = RF_model.predict(x_train)
y_pred_prob = RF_model.predict_proba(x_train)[:, 1]
train_accuracy = accuracy_score(y_train, y_pred)
train_roc_auc = roc_auc_score(y_train, RF_model.predict_proba(x_train)[:, 1])

skplt.metrics.plot_roc(y_train, RF_model.predict_proba(x_train),title="Training data ROC")
plt.show()

5 Performance on blind test

Show the code
y_pred = RF_model.predict(x_test)
y_pred_prob = RF_model.predict_proba(x_test)[:, 1]
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

skplt.metrics.plot_roc(y_test, RF_model.predict_proba(x_test),title="Test data ROC")
plt.show()

Show the code
skplt.metrics.plot_calibration_curve(y_test=="PD",[RF_model.predict_proba(x_test)],["RF"])
plt.show()

Show the code
skplt.metrics.plot_confusion_matrix(y_test, y_pred)
plt.show()

Accuracy: 0.8 ROC AUC: 0.8

thresholds sensitivities specificities
9 0.655 100.00 47.62
12 0.610 94.74 61.90
14 0.555 89.47 71.43
17 0.525 78.95 85.71
19 0.505 63.16 95.24
29 0.165 10.53 100.00

6 The trees in the forest

The default ‘ensemble’ of 100 trees is used in Models.py. The average result from applying all of the trees to a new sample is what determines the model prediction. This makes the tree more robust to individual data issues and dropouts.

6.1 Tree 1

Tree 1

6.2 Tree 2

Tree 2

6.3 Tree 3

Tree 3

6.4 Tree 4

Tree 4

6.5 Tree 5

Tree 5

6.6 Tree 6

Tree 6

6.7 Tree 7

Tree 7

6.8 Tree 8

Tree 8

6.9 Tree 9

Tree 9

6.10 Tree 10

Tree 10

6.11 Tree 11

Tree 11

6.12 Tree 12

Tree 12

6.13 Tree 13

Tree 13

6.14 Tree 14

Tree 14

6.15 Tree 15

Tree 15

6.16 Tree 16

Tree 16

6.17 Tree 17

Tree 17

6.18 Tree 18

Tree 18

6.19 Tree 19

Tree 19

6.20 Tree 20

Tree 20

6.21 Tree 21

Tree 21

6.22 Tree 22

Tree 22

6.23 Tree 23

Tree 23

6.24 Tree 24

Tree 24

6.25 Tree 25

Tree 25

6.26 Tree 26

Tree 26

6.27 Tree 27

Tree 27

6.28 Tree 28

Tree 28

6.29 Tree 29

Tree 29

6.30 Tree 30

Tree 30

6.31 Tree 31

Tree 31

6.32 Tree 32

Tree 32

6.33 Tree 33

Tree 33

6.34 Tree 34

Tree 34

6.35 Tree 35

Tree 35

6.36 Tree 36

Tree 36

6.37 Tree 37

Tree 37

6.38 Tree 38

Tree 38

6.39 Tree 39

Tree 39

6.40 Tree 40

Tree 40

6.41 Tree 41

Tree 41

6.42 Tree 42

Tree 42

6.43 Tree 43

Tree 43

6.44 Tree 44

Tree 44

6.45 Tree 45

Tree 45

6.46 Tree 46

Tree 46

6.47 Tree 47

Tree 47

6.48 Tree 48

Tree 48

6.49 Tree 49

Tree 49

6.50 Tree 50

Tree 50

6.51 Tree 51

Tree 51

6.52 Tree 52

Tree 52

6.53 Tree 53

Tree 53

6.54 Tree 54

Tree 54

6.55 Tree 55

Tree 55

6.56 Tree 56

Tree 56

6.57 Tree 57

Tree 57

6.58 Tree 58

Tree 58

6.59 Tree 59

Tree 59

6.60 Tree 60

Tree 60

6.61 Tree 61

Tree 61

6.62 Tree 62

Tree 62

6.63 Tree 63

Tree 63

6.64 Tree 64

Tree 64

6.65 Tree 65

Tree 65

6.66 Tree 66

Tree 66

6.67 Tree 67

Tree 67

6.68 Tree 68

Tree 68

6.69 Tree 69

Tree 69

6.70 Tree 70

Tree 70

6.71 Tree 71

Tree 71

6.72 Tree 72

Tree 72

6.73 Tree 73

Tree 73

6.74 Tree 74

Tree 74

6.75 Tree 75

Tree 75

6.76 Tree 76

Tree 76

6.77 Tree 77

Tree 77

6.78 Tree 78

Tree 78

6.79 Tree 79

Tree 79

6.80 Tree 80

Tree 80

6.81 Tree 81

Tree 81

6.82 Tree 82

Tree 82

6.83 Tree 83

Tree 83

6.84 Tree 84

Tree 84

6.85 Tree 85

Tree 85

6.86 Tree 86

Tree 86

6.87 Tree 87

Tree 87

6.88 Tree 88

Tree 88

6.89 Tree 89

Tree 89

6.90 Tree 90

Tree 90

6.91 Tree 91

Tree 91

6.92 Tree 92

Tree 92

6.93 Tree 93

Tree 93

6.94 Tree 94

Tree 94

6.95 Tree 95

Tree 95

6.96 Tree 96

Tree 96

6.97 Tree 97

Tree 97

6.98 Tree 98

Tree 98

6.99 Tree 99

Tree 99